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Yale University and University of Florida 

We consider situations in Bayesian analysis where we have a fam- 
ily of priors on the parameter 6, where h varies continuously over 
a space H, and we deal with two related problems. The first involves 
sensitivity analysis and is stated as follows. Suppose we fix a func- 
tion / of 6. How do we efficiently estimate the posterior expectation 
of f{0) simultaneously for all h in H? The second problem is how 
do we identify subsets of H which give rise to reasonable choices 
of i^h? We assume that we are able to generate Markov chain sam- 
ples from the posterior for a finite number of the priors, and we de- 
velop a methodology, based on a combination of importance sampling 
and the use of control variates, for dealing with these two problems. 
The methodology applies very generally, and we show how it ap- 
plies in particular to a commonly used model for variable selection in 
Bayesian linear regression, and give an illustration on the US crime 
data of Vandaele. 

1. Introduction. In the Bayesian paradigm we have a data vector Y with 
density pe for some unknown 9 £ Q, and we wish to put a prior density on 9. 
The available family of prior densities is {vh^hGTi}, where h is called a hy- 
perparameter. Typically, the hyperparameter is multivariate and choosing 
it can be difficult. But this choice is very important and can have a large 
impact on subsequent inference. There are two issues we wish to consider: 

(A) Suppose we fix a quantity of interest, say, f{9), where / is a func- 
tion. How do we assess how the posterior expectation of f{9) changes as we 
vary h? More generally, how do we assess changes in the posterior distribu- 
tion of f{9) as we vary /i? 

(B) How do we determine if a given subset of Ti constitutes a class of 
reasonable choices? 
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The first issue is one of sensitivity analysis and the second is one of model 
selection. 

As an example of the kind of problem we wish to deal with, consider the 
problem of variable selection in Bayesian linear regression. Here, we have 
a response variable Y and a set of predictors Xi , . . . , Xq , each a vector of 
length m. For every subset 7 of {1,. . . ,q} we have a potential model 
given by 

Y = l„,/3o + X^P^ + £, 

where Im is the vector of m I's, is the design matrix whose columns con- 
sist of the predictor vectors corresponding to the subset 7, is the vector of 
coefficients for that subset, and e ~ ^^^(0, o"^/). Let q-y denote the number 
of variables in the subset 7. The unknown parameter is 9 = {'y,a,l3Q,/3^), 
which includes the indicator of the subset of variables that go into the linear 
model. A very commonly used prior distribution on 6 is given by a hierarchy 
in which we first choose the indicator 7 from the "independence Bernoulli 
prior" — each variable goes into the model with a certain probability w, inde- 
pendently of all the other variables — and then choose the vector of regression 
coefficients corresponding to the selected variables. In more detail, the model 
is described as follows: 

(1.1a) Y^M.milmPo + X^^^,a^I), 

(a2,/3o)~p(a2,/3o)ocl/a2; 

(1.1b) 

given (T,/3^~AAg^(0,5a2(X;X^)-i), 
(1.1c) J w^^ {1 - w)'>-i^ . 

The prior on (cr, /3o, /3-y) is Zellner's g-phor introduced in Zellner (1986), 
and is indexed by a hyperparameter g. Although this prior is improper, the 
resulting posterior distribution is proper. 

Note that we have used the word "model" in two different ways: (i) a model 
is a specification of the hyperparameter h, and (ii) a model in regression is 
a list of variables to include. The meaning of the word will always be clear 
from context. 

To summarize, the prior on the parameter 6 = (7, a, /3q,P^) is given by the 
two-level hierarchy (1.1c) and (1.1b), and is indexed hy h = {w,g). Loosely 
speaking, when w is large and g is small, the prior encourages models with 
many variables and small coefficients, whereas when w is small and g is 
large, the prior concentrates its mass on parsimonious models with large 
coefficients. Therefore, the hyperparameter h = {w,g) plays a very impor- 
tant role, and in effect determines the model that will be used to carry out 
variable selection. 

A standard method for approaching model selection involves the use of 
Bayes factors. For each h £ H, let mh{y) denote the marginal likelihood 
of the data under the prior z//j, that is, mh{y) = J pg{y)i'h{6) d9. We will 
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write m-h instead of mh{y). The Bayes factor of the model indexed by /i2 vs. 
the model indexed by hi is defined as the ratio of the marginal likelihoods 
of the data under the two models, m/ij/j^/ii, and is denoted throughout 
by B(h2,hi). Bayes factors are widely used as a criterion for comparing 
models in Bayesian analyses. For selecting models that are better than others 
from the family of models indexed hy h£T-L, our strategy will be to compute 
and subsequently compare all the Bayes factors B{h,hi), for all h£T-L, and 
a fixed hyperparameter value hi . We could then consider as good candidate 
models those with values of h that result in the largest Bayes factors. 

Suppose now that we fix a particular function / of the parameter 9; for in- 
stance, in the example, this might be the indicator that variable 1 is included 
in the regression model. It is of general interest to determine the posterior 
expectation Eh{f{6) | y) as a function of h and to determine whether or 
not Eh{f{9) I Y) is very sensitive to the value of h. If it is not, then two in- 
dividuals using two different hyperparameters will reach approximately the 
same conclusions and the analysis will not be controversial. On the other 
hand, if for a function of interest the posterior expectation varies consider- 
ably as we change the hyperparameter, then we will want to know which 
aspects of the hyperparameter (e.g., which components of h) produce big 
changes and we may want to see a plot of the posterior expectations as we 
vary those aspects of the hyperparameter. Except for extremely simple cases, 
posterior expectations cannot be obtained in closed form, and are typically 
estimated via Markov chain Monte Carlo (MCMC). It is slow and inefficient 
to run Markov chains for every hyperparameter value h. Section 2 reviews an 
existing method for estimating Eh{f{9) \ Y) that bypasses the need to run 
a separate Markov chain for every h. The method has an analogue for the 
problem of estimating Bayes factors. Unfortunately, the method has severe 
limitations, which we also discuss. 

In this paper we address the sensitivity analysis and model selection issues 
discussed above. Our approach involves running Markov chains correspond- 
ing to a few values of the hyperparameter, say, hi, . . . ,hk, and using these to 
estimate Eh{f{9) \ Y) for all h€T-L and also the Bayes factors B{h,hi) for 
all h£7i. The difficulty we face is that there is a severe computational bur- 
den caused by the requirement that we handle a very large number of values 
of h. Our approach for estimating large families of posterior expectations 
and Bayes factors is based on a combination of MCMC, importance sam- 
pling, and the use of control variates. The main contribution of this work is 
the development of theory to support the method. This theory can be used 
when dealing with implementation issues. The paper is organized as follows. 
In Section 2 we describe our methodology for estimating Bayes factors and 
posterior expectations, and give statements of theoretical results associated 
with the methodology. In Section 3 we discuss estimation of the variance 
and implementation issues. In Section 4 we return to the problem of vari- 
able selection in Bayesian linear regression, and show how our methodology 
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applies in that model. The Appendix gives proofs of the theorems stated in 
the paper. 

The idea of doing importance sampling using data streams from multi- 
ple densities has been investigated in several papers before. In Vardi (1985), 
GiU, Vardi and Wellner (1988), Geyer (1994), Meng and Wong (1996), Kong 
et al. (2003) and Tan (2004), it is assumed that we have samples from each 
density and that each density is known except for a normalizing constant. 
The objective is to estimate all possible ratios of normalizing constants, 
and expectations of a given function with respect to each of the densities. 
The estimates in all these papers are identical, although the computational 
schemes to obtain them given in these papers are different. Gill, Vardi and 
Wellner (1988) and Tan (2004) obtain the asymptotic distribution of the 
estimates when the samples are i.i.d., and Geyer (1994) gives the asymp- 
totic distribution when the samples are Markov chains satisfying certain 
regularity conditions. 

Our Bayesian framework is the same as the framework described above. 
Let iyh,y denote the posterior density of 6 given Y = y when the prior is Vh. 
The posterior densities i^hj,y are given by %,y(6') = pe{y)vhj{G) /mh^, where 
the functional form Pe{y)^hj{&) is known, but the normalizing constant nih. 
is not. Our perspective is different from that of the previous authors in 
that we are interested in estimation of the ratios mijrrii^^ and of posterior 
expectations J f {9)uh y[9) dO for a very large number of /I's. Consequently, in 
addition to the obvious computational demands for handling many /I's, we 
also have to deal with the fact that we will not have a sample from v^^y for 
every h€T-L, but only from Vhj^y^j = 1, . . . , A:. Thus, we are concerned with 
computational efficiency, in addition to statistical efficiency. These issues are 
discussed in detail in Section 2. 

2. Estimation of Bayes factors and posterior expectations. Suppose that 
we have a sample 9i, . . . ,6n (i.i.d. or ergodic Markov chain output) from the 
posterior density i^hi,y for a fixed hi and we are interested in the posterior 
expectation 



(2.1) E,{fi9)\Y = y)= [ f{9)^^^^yn,^y{9)d9 

for different values of h. Using the fact that 

/ — r~\ 7aV/ i'hi,y[9)d9 -1, 

J Pe[y)'^hAG)/mh^ 

we see that this expectation may be written as 



where the right-hand side of (2.2) does not involve the ratio m/i/m/jj. The 
idea to express J f{9)i'h,y{G) d9 in this way was proposed in a different con- 
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text by Hastings (1970). The right-hand side of (2.2) is the ratio of two 
integrals with respect to i^hi,y, each of which may be estimated from the se- 
quence 9i,. . . ,9n. We may estimate the numerator and the denominator by 

(2.3) -Y.fi9i)[M0^)/l^h,m and -Y,[MOi)/^h,m, 

1=1 1=1 

respectively, and j f {0)vh,y{G) dO is estimated by the ratio of these two quan- 
tities. 

The disappearance of the likelihood function on the right-hand side of (2.2) 
is very convenient because its computation requires considerable effort in 
some cases (e.g., when we have missing or censored data, the likelihood is 
a possibly high-dimensional integral). Note that the second average in (2.3) 
is an estimate of m/i/m/j^, that is, the Bayes factor B{h,hi). Ideally, we 
would like to use the estimates in (2.3) for multiple values of h using only 
a sample from the posterior distribution corresponding to the fixed hyper- 
parameter value hi. But, when the prior i/h differs from greatly, the 
two estimates in (2.3) are unstable because of the potential that only a few 
observations will dominate the sums. Their ratio suffers the same defect. 

A natural approach for dealing with the instability of these simple esti- 
mates is to choose k values hi, . . . ,hk & Ti and in (2.1) replace i'hi,y with 
a mixture X]s=i o.sVhs,y, where > 0, for s = 1, . . . , fe, and Yl^s=i = 1- For 
concreteness, consider the estimate of the Bayes factor. Let T/.y = X]s=i (^s^hs,y, 
and let dg = rrihjmh^ , s = 1, . . . ,k. Note that 

(2.4) B{h, hi)= [ -—^^M-—v.y{e) dO 



and 



/ 



f{e)vn,y{e)de = {B{hM))-' [ m—^^^^^^^^——v.y{d)d9 



(2.5) 

If{o){MO)/E':=i<isi^hM/dsp.y{o)de 
nMO)/ELiasi^hM/dsp.y{e)de 

[These two identities are valid under the condition that fhiG) = whenever 
^hs{0) = for all s.] Suppose that for each I = 1, . . . ,k we have Markov 
chain samples 9f\i = 1, . . . , n/, from the posterior density i'hi,y Letting n = 
"^^^=1 ns, if as = ns/n, then the pooled sample is a stratified sample from u.y. 
Doss (2010) considers the case where the vector d = {d2, ■ ■ ■ ,dky is known. 
In this situation, the right-hand side of (2.4) is the integral of a known 
function with respect to the mixture density V.y. He shows that under certain 
regularity conditions, the estimate of B{h,hi) obtained by replacing the 
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right-hand side of (2.4) by its natural Monte Carlo estimate using the pooled 
sample is consistent and asymptotically normal. 

In virtually all applications, the value of the vector d is unknown. The 
estimates of B{h,hi) and J f {6)vh y{9) dO that we consider in this paper 

are constructed by first forming an estimate d of d, and then using the 
natural Monte Carlo estimates of the integral in (2.4) and of the two integrals 
in (2.5) with d substituted for d. The MCMC scheme we will use involves 
the following two stages: 

Stage 1. Generate samples of^^^i = l,...,Ni, from i^hi,y, the posterior 
density of 9 given Y = y, assuming that the prior is , for each I = 1, . . . ,k, 
and use these N = X]i=i observations to form an estimate of d. 

Stage 2. Independently of stage 1, again generate samples i = 1, re;, 
from i'hi,y, for each I = 1, . . . ,k, and construct the estimate of the Bayes fac- 
tor B{h,hi) based on this second set of re = X]f=i^' observations and the 
estimate of d from stage 1. 

The estimate of d in stage 1 is formed using a method introduced by Vardi 
(1985), and this estimate is discussed in the beginning of Section 2.1. From 
now on, for Z = 1, . . . , A;, we use the notation Ai and a/ to identify the ra- 
tios Ni/N and re/ /re, respectively. 

It is natural to ask why we use two steps of sampling, instead of estimating 
the vector d and B{h,hi) from a single sample. The quantity considered in 
Doss (2010) is 



(2.6) B{h,hud) = }^}_^ 



and it involves the vector d. The estimate considered in the present paper 
is B{h,hi,d), where d is an estimate of d. The variance of B{h,hi,d) turns 
out to be greater than that of B{h,hi,d) (and this is true whether we use 
two steps of sampling or a single step). Thus, the variance decomposes as 
Var(i?(/i, hi,d)) = Var(i?(/i, hi,d)) + Vd, where Vd is the increase in variance 
resulting from using d instead of d. Because we wish to estimate B{h, hi) for 
a large number of /I's and for each h the computational time needed is linear 
in the total sample size, this total sample size cannot be very large. On the 
other hand, d needs to be estimated only once. So if generating the chains 
is not computationally demanding, then one can use very long chains to 
estimate d and so greatly reduce the term Vd- A precise statement regarding 
the benefits of the two-stage scheme would have to take into account the 
cost of computing the typical term in (2.6) and the cost of generating a point 
in the chain, and no such statement can be made at the level of generality 
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considered in this paper. However, in ah the examples we have encountered, 
for fixed computational resources, the two-stage scheme gives estimates with 
considerably smaller variance. We mention here that our theoretical results 
are stated for the two-stage schemes, but these results have analogues for 
the case where a single sample is used to estimate both d and the family of 
Bayes factors B(h,hi),h S H, and these are given in Buta (2010). 

A summary of the main contributions of the present work is as follows: 

(1) We develop a complete characterization of the asymptotic distribu- 
tion of the estimate (2.6) and also of a variant involving the use of control 
variates developed by Doss (2010) for the realistic case where d is estimated 
from stage 1 sampling. Included in our results is an explicit formula for the 
increase in variance resulting from using an estimate of d instead of d itself. 
(This contradicts statements in the literature to the effect that using a ^/n- 
consistent estimate of d rather than d itself does not inflate the variance; 
see our discussion in the Appendix.) 

(2) We develop an analogous theory for the problem of estimating a family 
of posterior expectations Eh{f{6) \ Y = y),h£'H. 

(3) For any of our estimators, the variance is a sum of two components, 
and we discuss how each of these may be estimated. An important problem 
is how to properly select the skeleton points /ii, . . . , /i^, and ideally we would 
like to position these in such a way that the variance is minimized. We show 
how the variance estimates can be used to suggest good sets of skeleton 
points. 

(4) We apply the methodology to the problem of Bayesian variable se- 
lection discussed earlier. In particular, we show how our methods enable us 
to select good values of /i = {w,g) and to also see how the probability that 
a given variable is included in the regression varies with {w,g). 

2.1. Estimation of Bayes factors. Here, we analyze the asymptotic dis- 
tributional properties of the estimator that results if in (2.6) we replace d 
with an estimate. Geyer (1994) proposes an estimator for d based on the 
"reverse logistic regression" method and Theorem 2 therein shows that this 
estimator is asymptotically normal when the samplers used satisfy certain 
regularity conditions. This estimator is obtained by maximizing with respect 
to ^2 , . . . , dfc the log quasi-likelihood 



k Ni 



( 



) 



(2.7) 



1=1 1=1 



•s 



As was mentioned earlier, the estimate is the same as the estimates obtained 
by Vardi (1985), Meng and Wong (1996) and Kong et al. (2003). We assume 
that for all the Markov chains we use a Strong Law of Large Numbers 
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(SLLN) holds for all integrable functions [for sufficient conditions see, e.g., 
Theorem 2 of Athreya, Doss and Sethuraman (1996)]. In the next theorem 
we show that if d is the estimate produced by Geyer's (1994) method, or any 
of the equivalent estimates discussed above, then the estimate of the Bayes 
factor given by 

(2,8) B(nMj)^j:j: "^'^^ 

1=1 i=l T.s=l^sJ^hs{&i )/ds 

is asymptotically normal if certain regularity conditions are met. In (2.8), 
di = 1. 

Before we state the theorem, we need to define the expressions that appear 
in the asymptotic variance. For I = 1, . . . ,k,i = 1, . . . ,ni, let 

,2.9) K - 



E:.i«.i'».(e,")/<i. 

(the Y-i/s depend on h, but this dependence is suppressed to lighten the 
notation), and let 

oo k 

Tf{h) = Var(yi,;) + 2^Cov(yi,i,yi+g,z), T\h) = Y,(^iTf{h). 

9=1 1=1 

Also, let c{h) be the vector of length k — \ for which the (j — l)th coordinate 
is 

B{h,hi) f aj^h,{e) 

J L.s=i(^sVhs[G)/ds 

(2.10) 

j = 2,...,k. 



Theorem 1 . Let h ^T-L he fixed. Suppose the chains in stage 2 satisfy 
conditions (Al) and (A2) in Doss (2010): 

\,...,k, the chain is geometrically ergodic. 

\,. . . ,k, there exists e > such that 



(Al) For each I 
(A2) For each I 



(2.11) 



E 



3(0 



2+e 



< OO. 



In the expectation in (2.11), 6\' Vf^^y. Assume also that the chains in 
stage 1 satisfy the conditions in Theorem 2 of Geyer (1994) that imply 

N(d — d) — 7'A/'(0,S). In addition, suppose the total sample sizes for the 
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two stages, N and n, satisfy n — )■ oo, and N ^ oo in such a way that 
n/N-^qe [0,00). Then 

V^{B{h, hiJ) - B{h, hi)) A AA(0, qc{h)'T.c{h) + T'^{h)). 

As alluded to earlier, there are two components to the expression for 
the variance. The first component arises from estimating d, and the sec- 
ond component is the variance that we would have if we had estimated the 
Bayes factor knowing what d is. As can be seen from the formula, the first 
component vanishes if g = 0, that is, if the sample size for estimating the pa- 
rameter d converges to infinity at a faster rate than does the sample size used 
to estimate the Bayes factor. In this case the Bayes factor estimator (2.8) 
using the estimate d has the same asymptotic distribution as the estimator 
in (2.6) which uses the true value of d. Otherwise, the variance of (2.8) is 
greater than that of (2.6), and the difference between the variances depends 
on the parameter q. This parameter is determined by the user and should be 
chosen in such a way as to minimize the variance given computer resources; 
this is discussed in Section 3. 



2.2. Estimation of Bayes factors using control variates. Recall that we 
have samples 9f\i = 1, . . . ,ni, from Uf^^yJ = I, ■ ■ . ,k, with independence 
across samples (stage 2 of sampling) and that, based on an independent set 
of preliminary MCMC runs (stage 1 of sampling), we have estimated the 
constants d2, ■ ■ ■ ,dk- Also, ni/n = ai and n = X^fL;^ n;. Let 

P.12) Y(S)= -'■"'> 

Recalling that v.y := X^^^^ ^si^hs,yi have E-p,y{Y{6)) = B{h, hi), where the 
subscript V.y to the expectation indicates that 9 ~ V.y. Also, for j = 2, . . . ,k, 
let 



(2.13) Z(J)(^) 



(2.14) 



Uh^{e)/dj-UhA<^) 

^hj,y{0) - l^hi,y{0) 



Expression (2.14) shows that £^.^(Z(j)(6i)) = 0. This is true even if the pri- 
ors Uhj and are improper, as long as the posteriors Vhj^y and Vhi^y are 
proper, exactly our situation in the Bayesian variable selection example of 
Section 1. On the other hand, the representation (2.13) shows that Z^^\0) 
is computable if we know the dj 's — it involves the priors and not the pos- 
teriors. [A similar remark applies to (2.12).] Therefore, if as in Doss (2010) 
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we define for I = 1, . . . ,k,i = 1, . . . ,ni 



(2.15) Ziy = l, Z)^= " ^ ^ \ 3 = 1....X 



then for any fixed /9 = (/32, . . . , /3a:)) 

^ k ni / k \ 

(2-16) ^^ = ^EErM-E/3.4? 

is an unbiased estimate of -B(/i, /ii). The value of /3 that minimizes the vari- 
ance of is unknown. As is commonly done when one uses control variates, 

we use instead the estimate obtained by doing ordinary linear regression 

(7) 

of the response Yi^i on the predictors Z^^ ,j = 2,...,k, and to emphasize 
that this estimate depends on d, we denote it by (3{d). Doss (2010) shows 
that (3{d) converges almost surely to a finite limit, P^^. His Theorem 1 
states that the estimator B^egih, hi) = I'^^^y obtained under the assumption 

that we know the constants ^2 , . . . , dfc , has an asymptotically normal distri- 
bution. As mentioned earlier, d2,. ■ ■ ,dk are typically unknown, and must be 
estimated. Let d2, ■ ■ ■ ,dk be estimates obtained from previous MCMC runs 
and let 

1=1 i=l \ j=2 / 

where Yi^i and Z^''i^ are as in (2.9) and (2.15), except using d for d, and ^{d) is 

the least squares regression estimator from regressing Yi^i on predictors Z^y , 
j = 2,...,k. 

The next theorem gives the asymptotic distribution of this new estimator, 
and before we state it we introduce some notation. Let 

k 

(2-18) Ui,i = Yu-^(3j^ii^Z^f 

i=2 

and let 

00 k 

(2.19) af(/i)=Var(C/i,;) + 2j^Cov([7i,i,[/i+3,0, <^'W = E«'^'W- 

9=1 1=1 

Also, let w{h) be the vector of length k — 1 for which the {t — l)th coordinate 
(t = 2, . . . , A:) is 

B{h,hi) f ati^htjO) , « ^ 
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(2.20) 

j=2 



Theorem 2. Suppose all the conditions from Theorem 1 are satisfied. 
Moreover, assume that R, the k x k matrix defined by 



= l,...,k, 



is nonsingular. Then 



V^i^^^^ - B{h, hi)) A AA(0, qw{h)'T.w{h) + a\h)). 

As mentioned above, for any /3, in (2.16) is an unbiased estimate of 
B{h,hi), which leads to the question of what is the optimal value of /3 to 
use. It is not difficult to see that when each of the sequences {9f^}^J^^ is 
i.i.d., the value of f3 that minimizes the variance of is 

/3opt,i.i.d. ■■= argminVarp.^ (y{9) - ^ /3,z(^)(^)) , 



13 



i=2 



that is, the optimal value is the same whether we have a random sample 
from T/.y or a stratified sample. It is natural to ask whether /3opt, i.i.d. is still 

optimal when the k sequences {9^^}^^^-^ are Markov chains. It turns out that: 

(i) /3opt,i.i.d. is not optimal, 

(ii) using /3opt^i,i.d. can actually increase the variance (when the Markov 
chains mix at significantly different rates, chains that are of the same length 
do not have the same "effective sample sizes," but /3opt, i.i.d. does not reflect 
this fact). 

In our experience, using /3opt,i.i.d.' or, more precisely, the least squares esti- 
mate [which in Doss (2010) was shown to converge almost surely to /3opt, i.i.d.]) 
typically gives a significant reduction in variance. Buta and Doss (2011) 
prove points (i) and (ii) above and also discuss an approach for estimating 
the value of /3 that is optimal in the Markov chain case. 

2.3. Estimation of posterior expectations. In this section we describe 
a method for estimating the posterior expectation of a function / when 
the prior is u^. Let us denote this quantity by 

I^^\h)= [ f{9)uhJ9)d9. 
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Define 



With the view of applying identity (2.5), we note that, assuming a SLLN 
holds for the Markov chains 9f\ l = l,...,k,i = l,...,ni, we have 

k rii k ^ ni 

1 



111. IK r<^( 

1=1 i=l 1=1 ' 1=1 

fiOH.,ym 



Y!l=lO'sVhs,y{^) 
= I^f\h) ■ B{h,hi) 

and 

A; n; 



n 

1=1 i=l 



[The li^z's are defined in (2.9); note that Yi^i = Y- j when / = 1.] Letting 

v^fc v^n; ylf] 

(2.21) {h, d) = ^'r , 

we see that /[•^l(/i, d) I^^\h), and replacing d with the estimate d ob- 
tained from stage 1 sampling, we form 



(2.22) i^f^hj) 



EliEZif{0?)Mo?)/{E'=i^siyhM'')/l 
Ei=i TZi Mo?)/{Es=i asi^k. i0?)/ds) 



The following theorem concerns the asymptotic behavior of this estimator, 
and to state it, we first define the expressions that appear in the asymptotic 
variance. Let 



oo 

711 = Var (y ) + 2 ^ Cov(y 'J , F ^_,) , 

9=1 

oo 

712 = 721 = Cov(y[J ,yi,z) + J][Cov(y[J,yi+,,o + Cov(yi,,,y[J^,)], 

9=1 

oo 

722 = Var(yi,0 + 2 J^Cov(yi,«,yi+g,z) 

9=1 
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and 

(2.23) m) = (l'' Z''), m=T.^im)- 

\121 122 J ^ 

Since (2.21) and (2.22) are ratios to which we will apply the delta method, 
we will consider the function g{u,v) =u/v, whose gradient is 'Vg{u,v) = 
{l/v,-u/v^y. Let 

p{h)=Vg{I^-f^h)B{h,hi),B{h,hi)y 

(2.24) 

X T{h)-Vg{I^f^{h)B{h,hi),B{h,hi)). 

Finally, let v{h) be the vector of length k — 1 for which the {j — l)th coor- 
dinate is 

Hh)]j^i= / — 7^ — '-T^h,y{e)de, 

(2.25) 

j = 2,...,k. 



Theorem 3. Suppose the conditions stated in Theorem 1 are satisfied 
and, in addition, for each I = 1, . . . ,k, there exists an e > such that 

(2.26) ^(|y|J|2+")<oo. 
Then 

V^(/[-^l {h, d) - /[^] {h)) A AA(0, qv{hy^v{h) + p{h)). 

The numerator of Tf\h,d) is an estimate of I^f\h)B{h,hi) and the de- 
nominator is an estimate of B{h,hi). It is possible to adjust both the nu- 
merator and denominator through the use of control variates and thus arrive 
at a variant of T^\h,d)] the theory for this is developed in Buta (2010). As 
for the case of estimating the Bayes factors, the variant is not guaranteed 
to give an improvement, but a large improvement is often noted. 

3. Variance estimation and selection of the skeleton points. Estimation 
of the variance of our estimates is important for several reasons. In addition 
to the usual need for providing error margins for our point estimates, vari- 
ance estimates are of great help in selecting the skeleton points. The main 
approaches for estimation of the variance are (i) spectral methods, (ii) meth- 
ods based on batching, and (iii) methods based on regeneration; see Flegal 
and Jones (2010) and Mykland, Tierney and Yu (1995) for a review. Meth- 
ods based on batching are difficult to use in our framework because of two 
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complications, namely, that we are dealing with multiple chains, and we 
have a two-stage scheme; and procedures based on regeneration are often 
difficult to implement. Here we describe a way of estimating the variance 
using spectral methods. 

For the sake of concreteness, consider B{h, hi,d), whose asymptotic vari- 
ance is the expression k (h) = qc{hyT,c{h) +T^ih) (see Theorem 1). The 
term r^(/i) is the asymptotic variance of the quantity B{h,hi,d) in (2.6), 

and since the k Markov chains are independent, T'^{h) = X^fLi o.i'^fih)-, whe- 
re Tf{h) is the asymptotic variance of 

(3.1) 1^ ""-^ 



Now for each / we will estimate Tf{h) by the asymptotic variance of 



(3.2) 



1 1, (d^h 



where d is formed from stage 1 runs. It is not too difficult to show that 
under our asymptotic regime where n/N ^ q £ [0,oo), standard consistent 
spectral estimates of the asymptotic variance of (3.2) are also consistent es- 
timates of the asymptotic variance of (3.1); details are given in Buta and 
Doss (2011). Geyer (1994) gives an expression for S that is explicit enough 
to enable us to estimate it via standard spectral methods. Now, c{h) is 
a vector each of whose components is an integral with respect to the pos- 
terior i'h,y [see (2.10)]. The estimate derived in Section 2.3 [see (2.22)] is 
designed precisely to estimate such posterior expectations. Combining, we 
arrive at an overall estimate of n'^{h), and the asymptotic variances of our 
other estimates are handled similarly. 

Selection of the skeleton points. The asymptotic variances of any of our 
estimates depend on the choice of the points hi,...,hk- For concreteness, 
consider B{h, hi,d), and to emphasize this dependence, let V{h, /ii, . . . , hk) 
denote the asymptotic variance of B{h,hi,d). For fixed hi, . . . ,hk, identify- 
ing the set of /I's for which V{h,hi, . . . ,hk) is finite is typically a feasible 
problem. For instance, Doss (1994) considered the pump data example dis- 
cussed in Tierney (1994), for which the hyperparameter h has dimension 3, 
and determined this set for the case k = 1. He showed that one can go as 
far away from hi as one wants in certain directions, but in other directions 
the range is limited. (The calculation can be extended to any k.) Suppose 
now that we fix a range 7i over which h is to vary. A necessary first step is 
to select hi,. . . ,hk such that V{h, hi, ... , h^) < oo for all h gH. Typically, 
however, we will want more, and we will face the problem below. 
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Design problem: find the values of the skeleton points hi, . . . ,hk that min- 
imize max/ig^ V{h,hi,. . . ,hk). 

Unfortunately, except for extremely simple cases, it is not possible to cal- 
culate V{h, hi, ... , hi^) analytically [even if k = l, V{h, hi) is an infinite sum 
each of whose terms depends on the Markov transition function in a compli- 
cated way] , and maximizing it over hGTi would present additional difficul- 
ties. Furthermore, even if we were able to calculate maxh^n ^ih, hi,. . . , hjS), 
the design problem would involve the minimization of a function of /c x 
dimiT-L) variables, and, in general, solving the design problem is hopeless. 

In our experience, we have found that the following method works rea- 
sonably well. Having specified the range %, we select trial values /ii, . . . , /i^ 
and plot the estimated variance as a function of h, using one of the methods 
described above. If we find a region in % where this variance is unacceptably 
large, we "cover" this region by moving some his closer to the region, or by 
simply adding new /i;'s in that region, which increases k. This is illustrated 
in the example in Section 4. 

The relative lengths of the stages 1 and 2 chains. The parameter q af- 
fects the performance of any of the methods, and the optimal value involves 
a trade-off between time spent calculating density ratios in stage 2 and 
time spent generating the chains in stage 1. Consider, for instance, the esti- 
mate (2.8), whose asymptotic variance is given by Theorem 1 and which we 
will write as K^{h) = qvi{h) + V2{h). In the discussion below, we assume that 
we have run a small pilot experiment that has enabled us to adequately esti- 
mate the components vi{h) and V2{h), and we assume that the total sample 
sizes n and N are both large. The discussion is heuristic in that we assume 
that vi{h) and V2{h) are nearly constant in h. Let ti denote the time it typi- 
cally takes to generate a single step in a chain, let t2 denote the time it takes 
to compute the typical term in (2.8), and let g denote the number of values 
in Ti for which we wish to compute the estimate (2.8). Suppose we are given 
a computational budget of T units of time. For any q G (0,oo), the time it 
takes to compute (2.8) for g values of h is t{q) = {n/q)ti + nti + ngt2, and 
setting this equal to T determines n to be qT/[{q + l)ti + qgt2). The vari- 
ance of the estimate is then V{q) = T~^{vi{h) + V2Q1) / q){{q + l)ti + qgt2). 
Clearly, V{q) is unbounded as q ^ or g — t- 00. The function has a unique 
minimum, which occurs at ^opt = ^/[v2{h)tl]/[vl{h){tl + gt2)]. This last for- 
mula expresses in a usable manner the intuitive notion that if g is large, or 
if the cost of evaluating the density ratios in (2.8) is high relative to the cost 
of running the chains, then a small value of q should be used. 

4. Illustration on variable selection in Bayesian linear regression. There 
exist many classes of problems in Bayesian analysis in which the sensitivity 
analysis and model selection issues discussed earlier arise; see Section 5. 
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Here we give an illustration involving the hierarchical prior used in variable 
selection in the Bayesian linear regression model discussed in Section 1. For 
this model, the parameter is the vector 6 = (7, o", /3o, /^-y), and the prior on 6 
is given by the hierarchy (1.1c) and (1.1b). There exist several MCMC- 
based methods for estimating the posterior distribution of 9 given Y = y, 
and the algorithm we use here is based on the Gibbs sampler of Smith and 
Kohn (1996), which runs on the space of model indicators. Our algorithm, 
developed in Buta (2010), is a Markov chain on 6 that is uniformly ergodic 
and also computationally efficient (it avoids the need for repeated time- 
consuming matrix inversion). It is implemented in the R package bvslr, 
available from http : //www . stat . uf 1 . edu/-ebuta/BVSLR. 

In Sections 1 and 2, and I'h^y refer to the prior and posterior densities, 
and all estimates in Section 2 involve ratios of these prior densities. In the 
Bayesian linear regression model that we are considering here, the priors Vh 
on (7, o", f3o,l3-y) are actually probability measures on {0, 1}'^ x (0, 00) x , 
which in fact are not absolutely continuous with respect to the product of 
counting measure on {0,1}'^ and Lebesgue measure on (0,oo) x 1^"^+^. For 
hi = {wi,gi) and /i2 = {w2,g2), the Radon-Nikodym derivative of u^-^ with 
respect to i'h2 is given by 



where 4>q^{u;a,V) is the density of the g'^-dimensional normal distribution 
with mean a and covariance V, evaluated at u [Doss (2007)]. It is immediate 
that all formulas in Section 2 remain valid if ratios of the form i'h{G)/i^hi{9) 
[see, e.g., equation (2.3)] are replaced by the Radon-Nikodym derivative 
[duh / di'hiji^^) ■ Fortunately, evaluation of (4.1) requires neither matrix inver- 
sion nor calculation of a determinant, so can be done very quickly. Note that 
in view of (4.1), it is not enough to have Markov chains running on the 7's 
and we need Markov chains running on the 0's [or at least (7,0", /3^)]. 

There is a large literature on dealing with the hyperparameter in models 
involving Zellner's g-prior [with or without the variable inclusion line (1.1c)]. 
Some of the proposals involve putting a prior on g, or on both g and w. Liang 
et al. (2008) propose and discuss priors on g; priors on w are generally taken 
to be beta distributions. Other proposals give g as a deterministic function 
of m and q [e.g., g = max{m, q^} in Fernandez, Ley and Steel (2001)]. Liang 
et al. (2008) contains an extensive and critical review of the recommenda- 
tions given in this literature. The most common deterministic choice for w 
is w = 1/2. George and Foster (2000) recommend the empirical Bayes (EB) 



(4.1) 




X 



</>,,(/j^;o,gi(T^(x;x^)-i) 
^,(/37;o,52ct2(x;x^)-i)' 
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approach for estimating the pair {w^g): the marginal hkehhood of {w,g) is 
computed over a grid, and the value of {w,g) that maximizes it is taken as 
the estimate of {w,g). As with many likelihood-based methods, special care 
needs to be taken when the maximizing value is at the boundary. Cui and 
George (2008) give evidence that the EB method outperforms fully Bayes 
methods in this problem. Unfortunately, the EB method is in general com- 
putationally demanding because the likelihood is a sum over all 2^ models 7, 
so it is practically feasible only for relatively small values of q. Our method- 
ology handles this problem by estimating ratios of marginal likelihoods, that 
is, Bayes factors, and, besides giving the maximizing values of w and g, gives 
a plot which shows the behavior of the Bayes factors for a wide range of other 
values of w and g. 

We illustrate our methods on the US crime data of Vandaele (1978), which 
can be found in the R library MASS under the name UScrime. This data set 
seems ideal, because it has been studied in several papers already, so we can 
compare our results with previous analyses, and also because its modest size 
enables a closed- form calculation of the marginal likelihood nih, so we can 
compare our estimates with the gold standard. The data set gives, for each of 
m = 47 states of the USA, the crime rate, defined as number of offenses per 
100,000 individuals (the response variable), and q = 15 predictors measuring 
different characteristics of the population, such as average number of years 
of schooling, average income, unemployment rate, etc. 

To be consistent with what is done in the literature, we applied a log 
transformation to all variables, except the indicator variable. We took the 
baseline hyperparameter to be hi = (w)i,5i) = (0.5, 15), and our goal was to 
estimate B{h,hi) for the 924 values of h obtained when w ranges from 0.1 
to 0.91 by increments of 0.03, and g ranges from 4 to 100 by increments 
of 3. We used (2.17) and this estimate was based on 16 chains each of length 
10,000, corresponding to the skeleton grid of hyperparameter values 

(4.2) {w,g) G {0.3,0.5,0.6,0.8} x {15,50,100,225} 

for the stage 1 samples, and 16 new chains, each of length 1,000, correspond- 
ing to the same hyperparameter values, for the stage 2 samples. The plots 
in Figure 1 give graphs of the estimate (2.17) as w and g vary, from two dif- 
ferent angles. These indicate that values for w around 0.65 and for g around 
20 seem appropriate, while values of w less than 0.3 and values of g greater 
than 60 should be avoided. A side calculation showed that, interestingly, 
for g = max{m, q^} (= 225), the estimate of B{{w,g), (0.65, 20)) is less than 
0.008 regardless of the value of w, so this choice should not be used for this 
data set. With the long chains used and the estimate that uses control vari- 
ates, the Bayes factor estimates in Figure 1 are extremely accurate — root 
mean squared errors are less than 0.04 uniformly over the entire domain of 
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Fig. 1. Estimates of Bayes factors for the US crime data. The plots give two different 
views of the graph of the Bayes factor as a function of w and g when the baseline value 
of the hyperparameter is given by w — 0.5 and g = 15. The estimate is (2.17), which uses 
control variates. 

the plot and considerably less in the convex hull of the skeleton grid (our 
calculation of the root mean squared errors used the closed-form expres- 
sion for the Bayes factors based on complete enumeration). The figure took 
about a half hour to generate on an Intel 2.8 GHz Q9550 running Linux. 
(The accuracy we obtained is overkill and the figure can be created in a few 
minutes if we use more typical Markov chain lengths.) 

Table 1 gives the posterior inclusion probabilities for each of the fifteen 
predictors, that is, -P(7i = 1 | y) for i = 1, . . . , 15, under several models. Line 2 
gives the inclusion probabilities when we use model (1.1) with the values 
w = 0.65 and g = 20, which are the values at which the graph in Figure 1 at- 
tains its maximum. Line 4 gives the inclusion probabilities when the hyper-g 
prior "HG3" in Liang et al. (2008) is used. As can be seen, the inclusion 
probabilities we obtained under the EB model are comparable to, but some- 
what larger than, the probabilities when the HG3 prior is used. This is 
not surprising since our model allows w to be chosen, and the data-driven 
choice gives a value (0.65) greater than the value w = 0.5 used in Liang et al. 
(2008). [Table 2 of Liang et al. (2008) gives a comparison of posterior inclu- 
sion probabilities for a total of ten models taken from the literature.] Line 3 
of Table 1 gives the inclusion probabilities under model (1.1) when we use 
w = 0.5 and the value of g that maximizes the likelihood with w constrained 
to be 0.5. It is interesting to note that the inclusion probabilities are then 
strikingly close to those under the IIG3 model. 

Buta (2010) uses the estimates in Section 2.3 to produce plots of posterior 
inclusion probabilities for several of the predictors, as w and g vary. The 
plots enable one to read the posterior inclusion probabilities under various 
choices for g and w proposed in the literature, and also show that the extent 
to which these probabilities change with the choices is striking. 
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Table 1 

Posterior inclusion probabilities for the fifteen predictor variables in the US crime data 
set, under three models. Names of the variables are as in Table 2 of Liang et al. (2008) 
(but all variables except for the binary variable S have been log transformed) 





Age 


S 


Ed 


ExO 


Exl 




LF 


M 


N 


EB(0.65,20) 


0.93 


0.39 


0.99 


0.70 


0.51 




0.34 


0.35 


0.52 


EB(0.5,20) 


0.85 


0.29 


0.97 


0.67 


0.45 




0.22 


0.22 


0.38 


HG3 


0.84 


0.29 


0.97 


0.66 


0.47 




0.23 


0.23 


0.39 




NW 


Ul 


U2 


W 




X 




Prison 


Time 


EB(0.65,20) 


0.83 


0.40 


0.76 


0.55 




1.00 




0.96 


0.55 


EB(0.5, 20) 


0.70 


0.27 


0.62 


0.38 




1.00 




0.90 


0.39 


HG3 


0.69 


0.27 


0.61 


0.38 




0.99 




0.89 


0.38 



Selection of the skeleton points was discussed at the end of Section 3, and 
we now return to this issue. Consider the Bayes factor estimate based on 
the skeleton (4.2), which was chosen in an ad-hoc manner. The left panel in 
Figure 2 gives a plot of the variance of this estimate, as a function of h. As 
can be seen from the plot, the variance is greatest in the region where g is 
small and w is large. We changed the skeleton from (4.2) to 

(4.3) (u), g) G {0.5, 0.7, 0.8, 0.9} x {10, 15, 50, 100} 

and reran the algorithm. The variance for the estimate based on (4.3) is 
given by the right panel of Figure 2, from which we see that the maximum 
variance has been reduced by a factor of about 9. 




Fig. 2. Variance functions for two versions of I^^^y The left panel is for the estimate 
based on the skeleton (4-S). The points m this skeleton were shifted to better cover the prob- 
lematic region near the back of the plot (g small and w large), creating the skeleton (4-3). 
The maximum variance is then reduced by a factor of 9 (right panel). 
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5. Discussion. The following fact is obvious, but it may be worthwhile 
to state it explicitly. If hi is fixed, maximizing B{h,hi) and maximizing 
the marginal likelihood m/j are equivalent. Choosing the value of h that 
maximizes nih is by definition the empirical Bayes method. Thus, the devel- 
opment in Section 2 can be used to implement empirical Bayes methods. 

Our methodology for dealing with the sensitivity analysis and model se- 
lection problems discussed in Section 1 can be applied to many classes of 
Bayesian models. In addition to the usual parametric models, we mention 
also Bayesian nonparametric models involving mixtures of Dirichlet pro- 
cesses [Antoniak (1974)], in which one of the hyperparameters is the so-called 
total mass parameter — very briefly, this hyperparameter controls the extent 
to which the nonparametric model differs from a purely parametric model. 
[Among the many papers that use such models, we mention in particular 
Burr and Doss (2005), who give a more detailed discussion of the role of the 
total mass parameter.] The approach developed in Sections 2.1 and 2.2 can 
be used to select this parameter. 

When the dimension of h is low, it will be possible to plot B{h,hi), or 
at least plot it as h varies along some of its dimensions. Empirical Bayes 
methods are notoriously difficult to implement when the dimension of the 
hyperparameter h is high. In this case, it is possible to use the methods devel- 
oped in Sections 2.1 and 2.2 to enable approaches based on stochastic search 
algorithms. These require the calculation of the gradient dB{h,hi)/dh. We 
note that the same methodology used to estimate B(h,hi) can also be used 
to estimate its gradient. For example, in (2.8), fhi^^f^) is simply replaced by 

duh{eT)/dh. 

APPENDIX 

Proof of Theorem 1. We begin by writing 

^/^{B{h,hi,d) - B{h,hi)) 

(A.l) ... 

= y/n{B{h, hi,d)- B{h, hi,d)) + ^/T^{B{h, hi,d) - B{h, hi)). 

The second term on the right-hand side of the equation in (A.l) involves 
randomness coming only from the second stage of sampling. This term was 
analyzed by Doss (2010), who showed that it is asymptotically normal, with 
mean and variance t^(/i). The first term ostensibly involves randomness 
from both stage 1 and stage 2 sampling. However, as will emerge from our 
proof, the randomness from stage 2 is of lower order, and effectively all the 
randomness is from stage 1. This randomness is nonnegligible. We mention 
here the often-cited work of Geyer (1994) (whose nice results we use in the 
present paper). In the context of a setup very similar to ours, his Theorem 4 
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states that using an estimated d and using the true d results in the same 
asymptotic variance. From our proof [refer also to the extension of our The- 
orem 1 to the case of a simple sample given in Buta (2010)], we see that this 
statement is not correct. 

To analyze the first term on the right-hand side of (A.l), define the 
function F{u) = B{h,hi,u), where u = (u2, . . . , Ufc)' is a real vector with 
ui > 0,1 = 2, . . . ,k. Then, by the Taylor series expansion of F about d, we 
get 

^/n{B{h,hl,d) - B{h,hi,d)) 
(A.2) = V^{F{d) - F{d)) 

= V^VF{d)'{d -d) + ^{d- d)'V^F{d*){d - d), 

where d* is between d and d. 

First, we show that the gradient VF{d) = {dF{d) / dd2, . . ■ ,dF{d) /ddk)' 
converges almost surely to a finite constant. Recall that c{K) is defined 
in (2.10). For j = 2, . . . ,k, the (j — l)th component of 'VF{d) converges 
almost surely since, with the SLLN assumed to hold for the Markov chains 
used, we have 

^ 1 ^ a,aiz^/,(#)i/ft.(#) as 



1=1 



Next, we show that the random Hessian matrix V'^F{d*) of second-order 
derivatives of F evaluated at d* is bounded in probability. To this end, 
it suffices to show that each element of this matrix, say, [\7'^F{d*)]t~ij-i, 
where t,j G {2, . . .,k}, is Op(l). Since < ||<i-(i|| A 0, it follows that 

d*Ad. 

Let e G (0, min(d2, . . .,dk)). Then we have P{\\d* - d\\ <e)—?-l. We now 
show that, on the set {\\d* — d\\ < e}, V'^F{d*) is bounded in probability. 
Let 

I = md*-d\\<e). 

For t^j, we have 

\[V'F{d*)]t-i,j^i\-I 



ni dYdf{Y.UasV^SOr)ldl? 
^ 2 ^ a^aia,vu{ef^)vU0TyhAeT) 



<y-y- 

{d, - eY{dt - ef[Y.U^syhAe'i')/{ds + e)Y 
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(A.3) 



1=1 ^ 



ajaiatUh^ {9)vht {0)h'hi {9) 
[Y.'=i<^sUhAe)l{ds+e)f 



yh,y{o)de 



{dj-eY{dt-ey 

Note that the expression inside the braces in (A.S) is clearly bounded above 
by a constant, so expression (A.S) is finite. Similarly, for t = j, we can show 
that \[V^F{d*)]j_ij_i\ - I is Op{l). Since P{\\d* - d\\ < e) ^ 1, it follows 
that V'^F{d*) is bounded in probability. Now, by combining (A.l) and (A. 2), 
we obtain 

Vn{B{h,hiJ) - B{h,hi)) 

^VF{d)'VN{d-d) 

+ ^ \/^[^(^~ - d)]'V'F{d*)[^{d - d)] 

+ ^{B{h,hi,d) - B{h,hi)) 

= ^c{hyVN{d -d) + V^{B{h, hi,d)- B{h, hi)) + Op(l), 

where the last line follows from the fact that \7F{d) c{h) established ear- 
lier, the assumptions of Theorem 1 that y^n/N — t- and that y/N {d — d) 
converges in distribution [hence is Op(l)]. Because the two sampling stages 
[for estimating d and B{h, hi)] are assumed to be independent, using the as- 
sumption that ^/N {d — d) -4 A/'(0, S) in conjunction with the result y/n{B{h, 

hi,d)-B{h,hi)) A7V(0,r2(/i)) established in Theorem 1 of Doss (2010) un- 
der conditions (Al) and (A2), we conclude that 

V^{B{h, hiJ) - B{h, hi)) A AA(0, qc{h)'T.c{h) + T'^{h)). □ 

Proof of Theorem 2. We begin by writing 

V^(/|(,) - B{K hi)) = - + V^(I|(,) - B{K hi)), 

where the second term on the right-hand side of (A. 4) was analyzed by 
Doss (2010) who showed that it is asymptotically normal, with mean 
and variance a'^{h). Our plan is to show that (3{d) and (3{d) converge in 
probability to the same limit, which we denote Pn^^- We then expand the 
first term on the right-hand side of (A. 4) by writing 

(A.5) 



i ] 
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Our proof is organized as follows: 

• We note that the third term on the right-hand side of (A. 5) was shown to 
converge to in probability by Doss (2010). 

• We will show that the first term on the right-hand side of (A. 5) also 
converges to in probability. 

• The second term on the right-hand side of (A. 5) involves randomness from 
both stage 1 and stage 2. However, we will show that the randomness from 
stage 2 is asymptotically negligible, and that this term is asymptotically 
equivalent to an expression of the form w{hy{d — d), where w^h) is a de- 
terministic vector. This will show that the second term is asymptotically 
normal. 

Now we prove that the first term on the right-hand side of (A. 5) is Op(l), and, 
to do this, we begin by showing that I3{d) and (3{d) converge in probability 
to the same limit. Let Z be the n x k matrix whose transpose is 



(A.6) Z' = 


/ 1 ■ 

zn ■ 


1 

7(2) 


1 

7(2) 


1 

7(2) 


1 

7(2) 


■ ^ \ 

n-k)k 














y(k) , 



and let Y be the vector 

(A.7) Y = (yi,i, . . . , yi,2, . . . , Yn,,2, • • • , l^l.fc, • • • , l"„„fc)'. 

Let Z be the n x k matrix corresponding to Z when we replace d by d. 
Similarly, Y is like Y, but using d for d. 

For fixed j, j' G {2, . . . , /c}, consider the function 

n ^ _ l - '-^^(^f ) )M' - ^^^(^f ) 

^i=\i=\ 2ls=i<^sVhs{&i l/us 22s=i(^syhs{Gi )/us 

where u = {u2, ■ ■ ■ , Uk)' and ui > 0, for I = 2, . . . ,k. [On the right-hand side 
of (A. 8), ui is taken to be 1.] Note that setting u = d gives 

k rii 
1=1 i=l 

By the mean value theorem, we know that there exists a d* between d and d 
such that 

G{d) = G{d) + VG{d*)'{d-d) = % + VG{d*)'{d-d) + Op(l). 

Note that the last equality above comes from applying the SLLN. An argu- 
ment similar to that used in Theorem 1 to show that V^-F((i*) = Op(l) can 
now be applied to show that VG{d*) = Op{l). 
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Therefore, 

G{d) = %j/ + VG{d*y{d-d) + Op(l) 

= Rjj, + Op(l)op(l) + Op(l) 4 Rjjv. 

Similar arguments extend to the case j = 1 or j' = 1. By the fact that R is 
assumed invertible, we have 

(A.9) n(Z'Z)-iAR-^ 

In a similar way, it can be shown that 

(A.IO) Z'Y/nAv, 

where v is the same limit vector to which Z' Y /n has been proved to converge 
in Doss (2010). Combining (A.9) and (A.IO), we have 

0oid),m) = [n(Z'Z)-i][Z'Y/n] 4 (/3o,iim, Aim) = R^'v. 
Let e{j,l) = E{Z[^}). We now have 

i=2 \i=l i=l ^ ' ^ / 

To show that (A. 11) converges to in probability, it suffices to show that 
for each I and j 

(A.12) „v^f;(fMizfM).o,(i). 

j=l ^ ' ^ 
For fixed j £ {2, . . . ,k} and I £ {1,. . . , k}, define 

H(u) = — — 7^. 

i=l J2s=l'^sl^hs{0i')/Us 

for u = {u2, ■ ■ ■ , Uk)' with > 0, / = 2, . . . , /c, ui = 1. Note that H{d) = 
''^i ^ Sr=ii ■ (A.12) is true, we begin by writing 



j=2 \l=l i=l 

(A.ll) 



i=i ^ ■"' ^ j=l 



(A.13, ^^^±(^1^ 



H{d)-H{d) + Op{l). 
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Note that the fact that 'l27=i(i^i''i "^(i'O]/'^/) = Op{l), which was used 
to estabhsh the second equahty in (A. 13), is proved in Doss (2010). Now, 
applying the mean value theorem to the function H, we know that there 
exists a point d* between d and d such that (A. 13) becomes 

E ( ' J = '^H{d*nd -d)+ Op(i) 



1=1 



(A.14) =^i^^n;'^\Hid*yVN{d-d) 

+ Op{l), 

so that the right-hand side of (A.14) is Op(l). We now consider \/n(li — 
), the middle term in (A.5). Define 

Kf ^-l\-\-(^±^^£^—-S-R ^..(gfVn,-.,,(gf) ^^ 

" 1=1 i=l \ Es=iasl^hs{Sl l/Us j=2 Y.s=l^syhs{^l l/Us J 

where u = {u2, • . . , Uk)' , and ui > for I = 2, . . . ,k. By the Taylor series 
expansion, we have 

Mii^ - iiJ = V^vK{d)'{d-d) 

(A.15) 

+ y/^\{d - d)'V^K{d*){d-d), 

where d* is between d and d. We now consider 'VK{d). For t = 2, . . . ,k we 
have 

[VK{d)]t-i^[w{h)]t.i, 

where [w{h)]t-i was defined in (2.20). The Hessian matrix \7'^K{d*) can be 
shown to be bounded in probability, using an argument similar to the one 
used in the proof of Theorem 1. Therefore, using the fact that V'^K{d*) is 
bounded in probability, we can now rewrite (A. 15) as 

+ ^^VN{d-dyOp{l)VN{d-d) 



= ^w{hyvN{d-d) + op{i). 

Together with (A. 4), this gives 



V^{^^^^-B{h,h,)) = ^w{hyVN{d-d) + V^{il^^^-B{h,hi)) + Op{l) 
A AA(0, qw{hyilw{h) + a'^Qi)) 
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by the independence of the two stages of samphng, the assumption that 
y/N [d — d) is asymptotically normal with mean and variance S , and the 
result from Doss (2010) that V^i^'^i^^-^ — B{h,hi)) is asymptotically normal 
with mean and variance a'^{h). □ 



Proof of Theorem 3. First, we note that 

+ V^{P^{h,d)-i^f\h)). 



(A.16) 



We begin by analyzing the second term on the right-hand side of (A.16), 
which only involves randomness from the second stage of sampling, and show 
that it is asymptotically normal. As for the first term, a closer examination 
reveals that it is also asymptotically normal, with all its randomness coming 
from stage 1 . The asymptotic normality of the sum of these two terms then 
follows immediately from the independence of the two stages of sampling. 
Note that Ez=ia«-^(Wf ) = I^^^h) ■ B{h,hi), and, in particular, when 



/ ^ k ni 



1/2 



B{h,hi). Also, we have 



/ = 1, this gives Yyi=i aiE{Yi^i) 

■P-I^f\h).B{h,hi) 



1=1 i=l 

k 



\ 



1=1 i=l 
k ni 



(I 



(A.17) 



: n 



1/2 



n 



[/]^ 
I > 



1=1 i=l 

k rii 



1=1 

k 



1=1 i=l 



1=1 



/2 . 



ni 



1/2 



E 



1=1 
Yu 



By condition (2.26), assumption (A2) of Theorem 1, and the assumed geo- 
metric ergodicity and independence of the k Markov chains used, the vector 
in (A.17) converges in distribution to a normal random vector with mean 
and covariance matrix T{h) where T{h) is defined in (2.23). Since /'-^^ (/i, d) is 
given by the ratio (2.21), in view of (A.17), its asymptotic distribution may 
be obtained by applying the delta method to the function g{u^ v) = u/v. This 



gives 



^{^f\h,d) - /[/l(/i)) AaA(0,p(/i)), where p{h) is given in (2.24). 
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We now consider the first term on the right-hand side of (A. 16). Define 

for u = {u2, • • . , Uk)' with ui > ior I = 2, . . . ,k. Then 

Y^fc v^n; ylf] 

m = i^f^ {K d) = 

2^1=1 1^1=1 

and ,jn{i^f^ {h, d) - {h, d)) = ^Jn{L{d) - L{d)). Now, by the Taylor series 
expansion of L about d, we get 

V^{P^{h,d) - i^f\h,d)) = ^VL{d)'{d-d) + ^{d-dyV^L{d*){d-d), 

where d* is between d and d. First, we show that the gradient 'VL(d) con- 
verges almost surely to a finite constant vector by proving that each one of 
its components, [L{d)]j-i,j = 2,. . . ,k, converges almost surely. We have 

[VL((i)],-„i ^ j = 2,...,k, 

where [f is given in (2.25). As in the proof of Theorem 1, it can be 

shown that each element of the second-derivative matrix V'^L{d*) is Op{l). 
Now, we can rewrite (A. 16) as 

V^{ff\h,d)-I^f\h)) 

= ^vL{dy^{d -d) + (h, d) - {h)) 

^ '^[^(^^ - dylV^L{d*)[y/N{d - d)] 



2^/N \ N' 

= ^v{hyVN{d-d) + v^iP\h,d) - i^f\h)) + op{i). 

Since the two sampling stages are assumed to be independent, we conclude 
that 

V^(i[^l (/i, d) - /[^] {h)) A AA(0, qv{h)'^v{h) + p{h)). □ 
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SUPPLEMENTARY MATERIAL 

Additional technical details (DOL 10.1214/11-AOS913SUPP; .pdf). We 
show that when estimating the Bayes factors using control variates, the es- 
timate that is optimal when the samples are i.i.d. sequences is no longer 
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optimal when the samples are Markov chains. We also give technical argu- 
ments regarding the consistency of spectral estimates of the variance of our 
estimators. 
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